Skip to content

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

Open
YWHyuk wants to merge 28 commits into
developfrom
feature/togsim-cpp-trace
Open

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 28 commits into
developfrom
feature/togsim-cpp-trace

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

  • SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
  • The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
    • MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
    • COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

  • 256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
  • Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
  • Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

  • tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
  • Manual --trace_so GEMM through TOGSim.
  • Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch 2 times, most recently from cc507fd to f5e8e55 Compare June 19, 2026 08:12
YWHyuk and others added 23 commits June 22, 2026 21:13
Design-of-record + status + handoff for the C++ trace producer: post-vcix
MLIR -> skeleton+API -> EmitC -> compiled .so that TOGSim dlopens and feeds
to the existing timing Core. Async DMAs pair with explicit memory barriers
by the runtime tag slot (tag_id, tag_slot) via the Core tag table; the
SRAM-buffer last-writer DAG carries compute dependencies. Validated on the
256^3 GEMM: trace 2518 vs legacy 2698 on the real gem5 cycle table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
One op-walk generator and the one-line attribute builders/readers were
copied across the passes. Consolidate into passes/_mlir_util.py
(walk_ops; i32/i64/i64_array/str_attr; attr_int/attr_bool/attr_i64_array)
and adopt it in lower_to_vcix, decompose_transfer, dma_fine_grained,
lower_dma_to_gemmini, lower_vlane_idx. walk_ops needs no MLIR bindings so
the module imports mlir.ir lazily; pure functions, no module-global state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The compiler half of the trace pipeline. build_skeleton (C2) reduces a
post-vcix kernel to a loop skeleton + togsim.* API ops: dma_start ->
togsim.dma (tag_id + runtime tag index), dma_wait -> explicit
togsim.memory_barrier, compute node -> togsim.compute, then a use-based DCE
strips the data math. dep_analysis derives per-op SRAM read/write buffers
(the last-writer dependency DAG); cycle_table builds the tile_id->cycle
sidecar; lower_to_emitc (C4) rewrites togsim.* to emitc.call_opaque and
drives the upstream EmitC pipeline to C++. extension_codecache emits the
.so + cycle sidecar opt-in (TORCHSIM_DUMP_TRACE_SO=1), snapshotting the
gem5 cycle_list before the legacy TOG consumes it. tog_generator marked
DEPRECATED. No static event_id: an async dma pairs with its barrier by the
runtime tag slot, since one static op runs once per loop iteration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
TOGSim side of the trace pipeline. togsim_runtime.{h,cc} is the producer
ABI (v11): togsim_dma (void, carries tag_id + tag_slot), togsim_compute,
togsim_memory_barrier (the explicit async-DMA sync), togsim_compute_barrier,
togsim_core_alloc. togsim_loader records a TraceRec stream; the bridge
(togsim_trace_bridge) turns it into a TileGraph: an async dma and its
memory_barrier pair by (tag_id, tag_slot) through the Core tag table
(set_tag_finish / register_tag_waiter), the barrier becomes the last-writer
of the loaded buffer, and the SRAM read/write-buffer DAG drives compute
deps with the occupancy/latency systolic-array pipeline + an explicit
compute fence before a store. main.cc gains --trace_so/--cycle_table;
Instruction/Core gain MEMORY_BAR + COMPUTE_BAR and the pipeline-child model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
test_togsim_skeleton pins the togsim_ops vocabulary against the ABI header
and exercises build_skeleton on a post-vcix fixture (event-id-free output,
explicit memory_barrier). test_togsim_emitc builds the .so and checks the
EmitC/symbol-table shape + that it runs against a stub runtime. The
togsim_runtime test links the real runtime, runs the loader, and checks the
recorded trace (resolved addresses, tag-paired barriers, looked-up cycles).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The .so's exported entry function (the renamed kernel skeleton the loader
dlopens and runs) is renamed togsim_emit -> togsim_kernel. Pure rename of
the single ENTRY_SYMBOL contract (producer export == loader dlsym); no
signature or behavior change. Updated togsim_ops.ENTRY_SYMBOL, the runtime
header/loader, lower_to_emitc, the tests' dlsym/nm checks, and the design
docs. Left togsim_emitc (the C4 lowering / its test) untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…ag alloc)

The trace bridge's dma tag key has an empty accum component, so it pairs
correctly only for a single-tile reduction (the current GEMM). Document the
agreed fix for multi-tile-K and conv: hoist the tag memref alloc into the
reduction-loop body (coarse, pre-fine-grained DMA) so each reduction
iteration gets a fresh tag whose runtime identity is the per-iteration
tag_id -- no accum-axis enumeration, works for any reduction depth. Because
that alloc dominates both the load and wait nests, dma and memory_barrier
pair by the SSA tag handle, with tag_idx kept as the subtile slot. Comment
only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…d tag key)

The bridge keyed the Core tag table on the static (tag_id, tag_slot), so the
DMAs of successive reduction iterations of one static op shared a key and would
collide for multi-tile-K (and conv, reduction = kh*kw*C). Mint a fresh per-DMA-
record tag key (uniq) instead, and pair each memory_barrier with the CURRENT
load for its (tag_id, tag_slot) -- it is 1 load : N barriers (the load runs once
per reduction iteration; each consumer waits the same tag), and the load/consumer
nests run in order within the reduction body, so "current load" is correct (not a
FIFO). Distinct uniq per load => successive iterations never collide; axis-
agnostic, no coordinate enumeration. Single-tile GEMM is unchanged (2518 cycles).

FIXME kept: the per-iteration tag is reconstructed here from record order, while
the producer IR still carries one static func-entry tag alloc -- the faithful fix
is to hoist that memref.alloc into the reduction-loop body and emit a matching
per-iteration togsim.tag_alloc threaded by SSA (then uniq is unnecessary).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…conv)

A tag memref was allocated once at the func entry and reused by every reduction
iteration of a static DMA, so the per-iteration tag identity was only an
artifact of the timing path's bridge. Make it real in the IR: when fine-grained
splits a matmul load, allocate a fresh tag memref.alloc just before the coarse
dma_start and replace_all_uses_with the old tag -- this rewires both the
re-emitted dma_start AND its dma_wait, and the coarse dma sits at the reduction-
loop body level so the alloc dominates the load and wait nests. Each reduction
iteration thus allocates its own tag (distinct for multi-tile-K / conv, no
coordinate enumeration); the now-dead func-entry alloc is erased. Sync stores
keep their tag.

Legacy materializes to a distinct alloc per iteration (its calc_tag accum
component becomes redundant); verified the 256^3 GEMM still passes and the trace
path is unchanged at 2518 cycles. The bridge FIXME is updated: build_skeleton
still collapses the in-loop alloc to one static tag_id, so the bridge's per-record
uniq is still what distinguishes iterations until that identity is threaded as an
SSA tag handle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
… slot

build_skeleton carried the dma_wait tag index verbatim onto togsim.memory_-
barrier. lower_to_vcix builds that index with a -acc_iv term for each
accumulation (reduction) loop var -- a sentinel marking the reduction axis, not
an arithmetic offset (legacy TileGraphParser skips stride -1 for the same
reason). The matching async load index (dma_fine_grained) is subtile-only, so at
reduction iteration > 0 the producer evaluated -acc_iv to a negative slot, the
recorded barrier tag_slot diverged from the load slot, and TOGSim aborted with
"Key does not exist in subgraph's tag table" on subtile + multi-tile-K.

_strip_accum_terms now drops the negative-coefficient dim terms from the wait's
affine.apply (composing with a selector that zeros those dims), so the barrier
slot is subtile-only and pairs with its load. Reduction iterations are still
told apart by the per-iteration tag alloc and the fresh per-record Core key in
the bridge, not by the slot. Single-tile kernels (no reduction term) fall
through unchanged.

Verified: 256x512x256 forced to 128x128 subtiles (2 K-tiles) now runs to 5774
cycles instead of crashing; single-tile 256^3 unchanged. Adds a self-contained
regression for _strip_accum_terms.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Document that the trace tag_slot is subtile-only: build_skeleton strips the
lower_to_vcix -acc_iv accumulation marker from the dma_wait index so a
memory_barrier pairs with the slot its load wrote, mirroring legacy
TileGraphParser's skip of stride -1. Record that subtile + multi-tile-K
(256x512x256, 128x128 subtiles, 2 K-tiles) now runs at 5774 cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The -1 coefficients lower_to_vcix puts on accumulation loop vars in the A/B
dma_wait tag indices are a reduction-axis sentinel honored only by the legacy
TOG path (TileGraphParser); the trace path strips them in
build_skeleton._strip_accum_terms. Document this at both emission sites and note
they are kept for byte-identity with the C++ -test-pytorchsim-to-vcix pass and
should be removed (not flagged) once legacy retires. Comments only; output is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…_END

Replace the bare togsim_core_alloc marker with a higher-order
togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper. The runtime round-robins a core
from the pool, brackets the work-item with TILE_BEGIN/TILE_END trace records, and
invokes the producer's outlined tile function. The work-item scope is now exactly
the function call, not an implicit "ops until the next core_alloc" range, and one
general (kernel-independent) dispatcher serves every kernel via a uniform
iv-array tile signature (togsim_tile_fn). Core alloc and the begin/end boundary
are runtime-owned; the producer stays core-count transparent.

TraceRec gains TILE_BEGIN/TILE_END (drops DISPATCH); the bridge opens a subgraph
on TILE_BEGIN (bound to the record's core) and flushes it on TILE_END, and the
reference timer treats both as zero-cost boundaries. Verified on the subtile
256x512x256 case: 5774 cycles, identical to the pre-outline core_alloc form.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…spatch

lower_to_emitc now outlines the innermost parallel-loop body into a uniform
togsim_kernel_tile(ctx, iv, n) func and replaces it with a
togsim_dispatch(ctx, togsim_kernel_tile, iv, n) call, instead of inserting a bare
togsim_core_alloc marker inline. The dispatcher loop marshals the parallel
induction vars (m, n) into an int64 array and passes the tile fn as a verbatim
function pointer (#emitc.opaque), so the work-item scope is the tile function body
and the runtime wrapper owns the core-alloc + TILE_BEGIN/TILE_END boundary.

The outline runs after the togsim.* ops become emitc.call_opaque: it moves the
body ops into the tile fn, recovers each parallel index as index_cast(iv[k])
inside it, and remaps the captured ctx / induction vars / constants (Value == is
identity; external constants are cloned). Only ctx, the parallel IVs, and
constants may be captured (dynamic-shape captures raise -> P4). mlir-to-cpp
renders a static togsim_kernel_tile defined before the extern "C" togsim_kernel
dispatcher. togsim_ops gains DISPATCH_CALLEE / TILE_SYMBOL (drops
CORE_ALLOC_CALLEE).

Tests: the emitc/runtime harnesses define togsim_dispatch (calling the tile fn)
and the skeleton/emitc contract checks use DISPATCH_CALLEE; the outlined .so
builds, dlopens, and runs. Docs updated (outline DONE, ABI v12).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
…e path

Model the per-core VMEM/spad as a finite resource so the trace path does not
prefetch unboundedly. A load occupies its tile's footprint when it issues and the
buffer-version it fills is freed when its last consumer issues (tag-last: the
bridge tags only each version's last reader). A load that would overflow the spad
does not issue that cycle -- it retries until a consumer frees a tile.

- Instruction: per-load buffer-version id + footprint (from tile_numel*elem_bits);
  per-consumer list of versions it frees on issue.
- togsim_trace_bridge: group the fine DMAs that fill a coarse tile into one
  buffer-version (a read closes it -> the next write is a new version), tag the
  last reader to free it. Tracked buffers are the DMA-loaded ones; the accumulator
  / virtual SA-weights are never load-written, so they are not charged. The pool
  persists across work-items (one physical per-core spad).
- Core: per-core sram_used / sram_capacity (= core_spad_size_kb) + a version->bytes
  map; gate MOVIN issue on free space; release on COMP/MOVOUT issue.
- Simulator::check_frozen: if work remains (running()) but nothing is in flight,
  the spad is too small to hold a kernel's working set -- error out after a margin
  (kWedgeThreshold) instead of looping forever.
- core_spad_size_kb config key (default 0 = unset/unlimited). Only trace-path
  instructions are gated; legacy TileGraphParser insts keep alloc id -1.

Verified: 1024^3 GEMM unchanged at 16 MB (compute-bound); shrinking the spad
throttles loads and below one tile-pair the run reports "spad too small" rather
than deadlocking.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
core_spad_size_kb drives the trace-path SRAM throttle; provide it from the config
rather than a hardcoded default. TPUv2/v3/v4 VMEM = 16 MB (16384); the 8x8 toy
arrays = 128 KB x 8 = 1 MB (1024). stonne/heterogeneous configs are left unset
(different accelerator path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Convert a `--log_level trace` log into a Chrome Trace Event JSON (open in
https://ui.perfetto.dev). Per-core lanes dma / sa / vector; each hardware unit is
replayed as a server so real idle gaps show and slices do not nest.
- sa/vector: slice width = compute_cycle - overlapping_cycle (occupancy, tail
  excluded); --num-sa N splits the SA into sa0..saN-1.
- dma: slice = the request-injection window [INST_ISSUED, ASYNC_DMA_ISSUE]. When a
  load is blocked from issuing (spad full under the SRAM throttle) its INST_ISSUED
  is delayed past the engine-free time, so the stall shows as a real idle gap on
  the dma lane (vs. continuous injection when the spad is large).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Model a systolic array's weight registers as a finite resource so a preload
cannot run unboundedly ahead of the matmuls that consume its weight, and pin each
matmul to the SA its weight was preloaded into (weight-stationary locality).
Previously every COMP was round-robined independently, so preloads batched far
ahead (16-deep) and freed the B-spad they read too early.

A preload acquires a weight slot on a systolic array that has a free one
(round-robin among free); if all are full it does not issue that cycle and
retries. It pins its matmul consumers (its pipeline children) to that SA and
gives them a shared token. A matmul frees its slot when it is done READING the
weight -- the streaming phase, finish_cycle - overlapping_cycle -- not at full
finish: the drain tail flushes results without touching the weight, so releasing
at finish would hold the slot through the tail and stall the next double-buffered
preload (a visible SA bubble, ~2% inflated cycles). The release is scheduled at
issue into a per-core cycle-keyed queue drained before dispatch; the last
consumer frees the slot.

A preload with no matmul consumers is left alone, so paths without preload->matmul
occupancy edges (legacy TileGraphParser uses only add_child) keep unbounded
round-robin / infinite weights.

- Instruction: WeightToken {sa, refcount} + per-op _assigned_sa.
- Core: per-SA weight-slot counts (cap = sa_weight_buffer_depth); pick_free_weight_sa
  for the preload gate + SA choice; matmul runs on its weight's SA (rr fallback);
  _weight_release_q + process_weight_releases() for the streaming-end release.
- SimulationConfig/Common: sa_weight_buffer_depth (default 2 = weight double-buffer,
  a convention/tunable, not a verified per-gen constant).

1024^3 GEMM is compute-bound and unchanged (48184 == unbounded baseline), with
preloads now paced 1:8 with matmuls and SA lanes balanced 288/288; spad/weight-
bound cases tighten. Legacy path verified unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
… uses it

The trace timeline drew matmul/preload onto sa0/sa1 by round-robin of issue order,
which no longer matches reality now that the weight-buffer throttle pins each
matmul to the SA its weight was preloaded into. That made a store look like it
issued before a (mis-assigned) SA lane finished, when the model is in fact correct
(the compute barrier drains all SAs before the store issues).

Expose the SA the Core actually used (it is already recorded on the instruction
by the throttle):
- CoreTraceLog: add sa=<idx> to the COMP issue/finish detail line (-1 for vector).
- trace_timeline.py: place each SA op on the lane it reports (sa= field) and
  auto-split sa0..saN from it; round-robin stays as the fallback for older logs.

Lanes now reflect the real per-SA schedule and the store cleanly follows both SAs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
A core running several work-items (dispatches) over-serialized: the COMPUTE_BAR
before a store finished only once EVERY systolic array + the VPU had drained,
regardless of which dispatch's matmuls occupied them. So one tile's store waited
for an unrelated later tile's compute too -- on a 2-core 2048^3 GEMM, tile 0's
store issued at 71619 (after tile 2's compute) when tile 0's own compute finished
at 37536.

Make the fence drain only the computes it gates: each async compute, when it
issues, feeds its finish_cycle to its COMPUTE_BAR pipeline-child
(update_fence_finish, folded into the existing release_pipeline_children loop so
no extra pass), and the bar finishes once core_cycle reaches that max --
independent of other dispatches sharing the SA pipelines.

- Instruction: _fence_finish + update_fence_finish/get_fence_finish. Also carries
  a per-op work-item id (_tile_group) used by the trace/timeline in the next commit.
- Core: COMPUTE_BAR waits core_cycle >= fence_finish instead of all-SA-empty;
  finish_cycle is computed before release_pipeline_children so the fence is fed.

2048^3 (2 tiles/core): tile 0's store now issues at 38826 (right after its own
compute) and stores overlap the next tile -> 91648 -> 81883 (~10.7%).
Compute-bound 4096^3 unchanged (442023). Legacy unaffected (builds no COMPUTE_BAR).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Emit the dispatch work-item id on every trace instruction and use it in the
timeline viewer, and rework the DMA lanes so individual loads are legible.

- togsim_trace_bridge: stamp each instruction with its work-item index
  (cur_tile_group, bumped per TILE_BEGIN).
- CoreTraceLog: add tile= to the COMP / DMA / MEMORY_BAR detail lines (-1 for
  legacy, which has no work-item).
- trace_timeline.py:
  - color each slice by its tile (work-item) so one output tile's load / preload
    / matmul / store share a color across lanes and cores;
  - split the single dma lane into mvin / mvin-r / mvout: injection
    [issued, async] vs response [async, data-ready], so DRAM-response timing
    (shared-bandwidth contention) is visible separately from per-core injection;
  - serialize the injection on one DMA engine (server replay) so a load's bar is
    the engine time it actually uses, not iss->async inflated by queue wait;
  - label each DMA slice by its own addr_name so input / weight / K-panel loads
    stay distinct (tile is conveyed by color).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
The timeline needs the moment data starts arriving, not just when the last byte
lands (DRAM_RESP_DONE). The trace only had ASYNC_DMA_ISSUE (all requests injected)
and the final response, so a load's data window had to start at injection-done,
missing data that returns while the load is still injecting.

Emit DRAM_RESP_FIRST the first time an op's DRAM response arrives (a one-shot flag
on the Instruction, set in push_memory_response). The viewer then draws a load's
read-bandwidth bar from its first response to data-ready -- the real data-arrival
window, including bytes that came back during injection.

- TraceLogTags: kFirstDramResponse = "DRAM_RESP_FIRST".
- Instruction: _got_first_response one-shot flag + got/mark accessors.
- Core: log it on the first push_memory_response of an op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
Rework the DMA view along GPU-profiler lines, replacing the load-lifetime bars
(which overlapped and folded in queue wait) with bandwidth-resource lanes:
- dram-rd: a load's read bar [first DRAM response, data-ready] = the real
  data-arrival window (uses DRAM_RESP_FIRST), serialized on the aggregate
  bandwidth so each load is one visible bar (packed row = saturated bus).
- dram-wr: a store's write bar [issued, finished] -- writes go out with the
  request (fire-and-forget; acks land after the store has finished), so this, not
  the ack window, is the transfer.
- drop the dma-eng injection lane (the engine queue is not the bottleneck) and the
  in-flight counter.
Each DMA slice keeps its own addr_name label and tile color, so input / weight /
K-panel loads stay distinct and one output tile's ops share a color.

A saturated dram-rd against a half-idle SA reads as memory-bound at a glance: the
2-core 4096^3 GEMM shows dram-rd ~100% / SA ~59%, and doubling DRAM channels flips
it (SA ~100%, 442023 -> 277603 cycles).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from 1151f6a to 7f70bbb Compare June 22, 2026 12:13
YWHyuk and others added 4 commits June 22, 2026 22:46
The minimal Instruction(Opcode) constructor used for barriers (MEMORY_BAR,
COMPUTE_BAR) left ready_counter uninitialized, while the full constructor sets it
from num_parents. A barrier accumulates its count via inc_ready_counter from each
parent starting from that garbage value, so dec_ready_counter never returns to 0
unless the garbage was already 0. The barrier then never becomes ready and the
kernel never completes -- the frozen-state guard fires with a misleading
"spad too small" message. Whether the garbage was 0 depended on process memory
layout (env size such as the presence of TORCHSIM_DIR shifts it), making the
wedge a non-deterministic heisenbug.

Give ready_counter a default member initializer of 0 so the barrier ctor starts
from a correct base.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f
Emit the trace producer .so on every compile (best-effort) and drive the
standalone TOGSim run from it by default; the legacy ONNX TOG is now the opt-in
fallback via TORCHSIM_LEGACY_TOG=1. Previously the .so was emitted only under
TORCHSIM_DUMP_TRACE_SO=1 and run only under TORCHSIM_RUN_TRACE=1, so the existing
test suite never exercised the C++ TOG. Autotune candidates still run legacy (the
.so is a single tiling); the trace path drives the final chosen-tiling run, and a
missing .so falls back to legacy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f
codegen_nodes unpacked self.autotune()[:2] into (optimal_src_code, meta_code)
when the strategy is autotune and timing mode is on. autotune returns
[None, None, None] when it cannot autotune -- e.g. a size-1 pointwise kernel whose
ranges == [1], so make_choices yields no candidates -- which clobbered the valid
meta_code (the kernel's arg_attributes) with None. The fall-through then returned
that None, so the generated wrapper passed arg_attributes=None to the cycle-sim
caller and MLIRKernelCallerCodeGen crashed on len(None) (e.g. test_add with a
functional-off timing config). Unpack into a temporary so the original meta_code
survives the no-autotune case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f
The trace path returned without printing the "Simulation finished" marker that
the Python result parser (TOGSimulator.get_result_from_file) searches for, so it
warned "Unable to parse the output file" and returned inf metrics. Print the marker
before the core stats, matching the legacy path's order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f
Several Instruction members had no default initializer and were left as garbage by
the minimal Instruction(Opcode) barrier constructor (and overlapping_cycle even by
the full constructor): compute_cycle, overlapping_cycle, start_cycle, finish_cycle,
subgraph_id, dram_addr, _tile_numel. Reading garbage from these in occupancy /
weight-release timing produces memory-layout-dependent wedges that only surface
under some process layouts -- the same heisenbug class as the ready_counter fix.
Give them all a default initializer of 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant